The purpose of this project is to predict the DC power generation with the data collected from two different power plants. This is a supervised learning regression task. Having this model/information will allow the user to predict the solar power generation based on expected weather data and to identify hardware issues quickly.
It is critical to predict future power generation of renewables because our grid is comprised of many different generation sources, and some of those sources are difficult startup and shutdown, including nuclear, oil, and gas plants. Understanding the future power generation of renewables will allow the time required to startup other plants to overcome any energy deficit.
The data that I have also includes information about the specific inverters associated with a bank of solar panels. Having this information will allow the user to identify when different hardware isn't performing up to specification. This will give them the opportunity to fix the hardware in a timely manner.
To make these predictions, I will use various supervised learning techniques, including linear regression, polynomial regression with feature engineering, and decision trees. I will compare the models and determine a recommended model for the user.
I am using a tabulated dataset from two different solar generation plants from Kaggle[^1]. Associated with each plant is a second dataset that has relevant weather data from May 15th 2020 to June 15th 2020.
Plant 1 has 68777 rows and 7 columns. Plant 2 has 67698 rows and 7 columns. The columns are:
Weather 1 has 3181 rows and 6 columns. Weather 2 has 3259 rows and 6 columns. The columns are:
[^1]: Kannal, A. (2020). Solar Power Generation Data [data file]. Kaggle. https://www.kaggle.com/datasets/anikannal/solar-power-generation-data/download?datasetVersionNumber=1
I combined data cleaning and exploratory data analysis because as I was exploring the data, I discovered additional cleaning that needed to be done. Since it doesn't make sense to separate these steps in this case, I will discuss both in this section.
The following steps were performed to clean the data and ensure data integrity for the machine learning models:
##Libraries
import opendatasets as od
import joblib
import os
import json
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import IsolationForest
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error,r2_score
from itertools import combinations
model_path = 'models'
##Functions
## Filters out outliers based on the z-score
def filter_outliers(data, column, threshold=3):
"""Removes the outliers of a column in a dataframe based on the z-score set by the threshold.
Args:
data (dataframe): Dataframe to be filtered
column (string): Name of the column to be filtered
threshold (int, optional): z-score to filter the column on. Defaults to 3.
Returns:
dataframe: Filtered dataframe
"""
mean = data[column].mean()
std = data[column].std()
z_scores = (data[column] - mean) / std
return data[(z_scores < threshold) & (z_scores > -threshold)]
## Calculate the difference between the current row and the previous row, grouped by the specified column and sorted by another specified column.
def add_diff_column(df, groupby_cols, diff_cols, diff_name):
"""
This function takes a pandas DataFrame, groups it by one or more columns,
sorts it by one or more columns, calculates the difference between the current row and the previous row
for one or more columns, and adds the difference as a new column to the original DataFrame.
Parameters:
df (pd.DataFrame): The original DataFrame.
groupby_cols (list): The column(s) to group the data by.
diff_cols (list): The column(s) to calculate the difference for.
diff_name (str): The name to use for the difference column.
Returns:
df (pd.DataFrame): The original DataFrame with the added difference column.
"""
df = df.sort_values(by=groupby_cols+diff_cols)
grouped = df.groupby(groupby_cols)
for col in diff_cols:
df[f'{diff_name}_{col}'] = grouped[col].diff()
return df
## Plot the data for the analysis of the outlier filter performance.
def plot_concat_df(concatDFbeforeFilter):
"""Plot the data for the analysis of the outlier filter performance.
The plots will include a histogram of the ratio of dc_power to irradiation,
a relational plot of dc_power and irradiation by plant_id,
and a line plot of dc_power and irradiation by inverter_id
Args:
concatDFbeforeFilter (pandas dataframe): dataframe that includes the plant performance data and the weather data.
Returns:
None
"""
concatDF = concatDFbeforeFilter[(concatDFbeforeFilter['dc_power']>0) & (concatDFbeforeFilter['dc_power_to_irradiation_ratio']<math.inf)]
plt.figure()
concatDF['dc_power_to_irradiation_ratio'].hist(bins = 50)
plt.title('Histogram of dc_power_to_irradiation_ratio')
plt.figure()
sns.relplot(concatDF,row = 'plant_id', y = 'dc_power',x = 'irradiation', hue = 'source_key',alpha = 0.1, legend = False)
plt.title('Relational plot of dc_power and irradiation by plant_id')
plt.figure()
plt.title('Line plot of dc_power and irradiation by inverter')
sns.lineplot(data = concatDF, x = 'x_time_stamp',y = 'dc_power', hue = 'source_key', alpha = 0.1, legend = False)
ax2 = plt.gca().twinx()
sns.lineplot(data = concatDF, x = 'x_time_stamp',y = 'irradiation', ax = ax2, color = 'red')
plt.title('Line plot of dc_power and irradiation by inverter')
for i in range(concatDF.day.min(),concatDF.day.max()):
df = concatDF[(concatDF['day'] == i) & (concatDF['plant_id'] == 4135001)]
plt.figure()
sns.scatterplot(data = df, x = 'x_time_stamp',y = 'dc_power', hue = 'source_key', alpha = 0.1, legend = False)
ax2 = plt.gca().twinx()
sns.lineplot(data = df, x = 'x_time_stamp',y = 'irradiation', ax = ax2, color = 'red')
plt.title(f'Scatter plot of dc_power and irradiation for day {i} and plant 4135001')
plt.show()
for i in range(concatDF.day.min(),concatDF.day.max()):
df = concatDF[(concatDF['day'] == i) & (concatDF['plant_id'] == 4136001)]
plt.figure()
sns.scatterplot(data = df, x = 'x_time_stamp',y = 'dc_power', hue = 'source_key', alpha = 0.1, legend = False)
ax2 = plt.gca().twinx()
sns.lineplot(data = df, x = 'x_time_stamp',y = 'irradiation', ax = ax2, color = 'red')
plt.title(f'Scatter plot of dc_power and irradiation for day {i} and plant 4136001')
plt.show()
##Linear Regressioon using OLS and Forward Selection
def forward_selection(data, target, significance_level=0.05):
"""Performs a forward selection iteration to find the best features for a linear regression model.
Args:
data (dataframe): dataframe of the features to be used in the model
target (dataframe): dataframe or series of the target variable
significance_level (float, optional): Minimum P-value of the latest feature being tested. If the p-value of the best feature for the iteration
is less than this value, the function ends and returns a model without that feature. Defaults to 0.05.
Returns:
tuple: (Dictionary of the best features and their p-values, statsmodels OLS model)
"""
initial_features = data.columns.tolist()
remaining_features = list(set(initial_features))
best_features = []
return_dict = {}
while (len(remaining_features) > 1): #Was initial Features... Why? It has worked because I always ran into an insignificant feature to break out of the loop.
#print(f'Initial Features List {initial_features}')
remaining_features = list(set(initial_features)-set(best_features))
#print(f'Remaining Features {remaining_features}')
new_pval = pd.Series(index=remaining_features, dtype='float64')
new_adjRsquared = pd.Series(index=remaining_features, dtype='float64')
model_list = {}
for new_column in remaining_features:
model = sm.OLS(target, sm.add_constant(
pd.DataFrame(data[best_features+[new_column]]))).fit()
new_pval[new_column] = model.pvalues[new_column]
new_adjRsquared[new_column] = model.rsquared_adj
model_list[new_column] = model
min_p_value = new_pval.min()
max_r_squared = new_adjRsquared.max()
# Add the largest r-squared value to the best features if the p-value is below the threshold
#print(f'new_adjRsquared: {new_adjRsquared}')
#print(f'new_pval: {new_pval}')
if new_pval[new_adjRsquared.idxmax()] < significance_level:
best_model = model_list[new_adjRsquared.idxmax()]
best_features.append(new_adjRsquared.idxmax())
return_dict[new_adjRsquared.idxmax()] = (
new_pval[new_adjRsquared.idxmax()], new_adjRsquared.max())
else:
break
return (return_dict, best_model)
##Linear Regression using OLS and Best Subset Selection
def subset_regression_ols(X,y,k): # sourcery skip: use-fstring-for-concatenation
'''
Performs forward selection for the best subset of predictors
Input:
df(pandas dataframe): dataframe of the train data, including label
k(integer): Number in each of the subsets
label(string): The column name of the labeled column
Output:
best_subset(tuple): Set of the best subsets (length = k)
model(sm.OLS.model): Model of the best subset
adjustedRSquared(float): Adjusted R Squared value of the best subset
'''
predictors_set = set(X.columns)
label = y.name
best_predictors = ''
best_adjustedRSquared = 0
df = pd.concat([X,pd.DataFrame(y)],axis = 1)
list_of_combinations = combinations(predictors_set,k)
for predictors in list_of_combinations:
pred_formula = "+".join(predictors)
#print(pred_formula)
#print(predictors)
model = ols(formula=label + "~" + pred_formula, data=df).fit()
if model.rsquared_adj>best_adjustedRSquared:
best_predictors = predictors
best_adjustedRSquared = model.rsquared_adj
best_model = model
return (best_predictors, best_adjustedRSquared, best_model)
## Fuction to plot the actual vs predicted values and regression line for the model
def plot_actual_vs_predicted(model,X_test,y_test,x_feature):
"""Plot the actual vs predicted values and regression line for the model passed to the function
Args:
model (statsmodels.ols): Statsmodels OLS model
X_test (dataframe): pandas dataframe of the test data including a constant column
y_test (series): pandas series of the test labels
x_feature (string): column name of the feature to be used for the x-axis of the regression line and predictor values
"""
#Domain Values for the regression line
x = np.linspace(X_test[x_feature].min(),X_test[x_feature].max(),2)
#Get line based on coefficient for the feature
y = model.params[x_feature]*x
#Scatter plot of actual vs predicted values
plt.figure(figsize=(10,6))
plt.scatter(X_test[x_feature], y_test, color='blue', label='Actual', alpha = 0.1)
plt.scatter(X_test[x_feature], model.predict(X_test[model.model.exog_names[:]]), color='red', label='Predicted', alpha = 0.025)
#Plot Regression Line
plt.plot(x,y, color = 'green', label = 'Regression Line')
plt.title('Actual vs Predicted with the Forward Selection Model')
plt.xlabel(x_feature)
plt.ylabel('DC Power')
plt.legend()
plt.show()
def polynomial_features(X_train, X_test, degree):
"""Returns a dataframe of the polynomial features for the degree passed to the function
Args:
X_train (dataframe): dataframe of the train data that needs to be transformed into polynomial features
X_test (dataframe): dataframe of the test data that needs to be transformed into polynomial features
degree (integer): degree of the polynomial features
Returns:
tuple: dataframes of the transformed train and test data
"""
poly = PolynomialFeatures(degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)
X_train.poly = pd.DataFrame(X_train_poly)
X_test.poly = pd.DataFrame(X_test_poly)
return X_train_poly, X_test_poly
def variance_inflation_factor(df, feature_name, print_summary = False):
"""Variance inflation factor (VIF) for each explanatory variable. Iterates through the list of features and assigns
each feature to the exog variable and calculates the VIF for that feature.
Args:
feature_name (str): name of the feature for which the VIF should be calculated
exog (pandas DataFrame): data with each column representing an explanatory variable
exog_col (str): name of the exog variable in the exog DataFrame for which the VIF should be calculated
Returns:
float: VIF for the exog_col variable
"""
#print(f"Feature Name: {feature_name}")
x_i = df[feature_name]
x_noti = df.drop(feature_name, axis=1)
x_noti.astype('float64')
model = sm.OLS(x_i, x_noti).fit()
r_squared_i = model.rsquared
if print_summary:
print(model.summary())
vif = 1. / (1. - r_squared_i)
return vif
plant1 = pd.read_csv('solar-power-generation-data/Plant_1_Generation_Data.csv')
plant2 = pd.read_csv('solar-power-generation-data/Plant_2_Generation_Data.csv')
weather1 = pd.read_csv('solar-power-generation-data/Plant_1_Weather_Sensor_Data.csv')
weather2 = pd.read_csv('solar-power-generation-data/Plant_2_Weather_Sensor_Data.csv')
#Make all column names lowercase and easier to work with
plant1.columns = plant1.columns.str.lower()
plant2.columns = plant2.columns.str.lower()
weather1.columns = weather1.columns.str.lower()
weather2.columns = weather2.columns.str.lower()
#Set Correct Data Types
plant1['date_time'] = pd.to_datetime(plant1['date_time'])
plant2['date_time'] = pd.to_datetime(plant2['date_time'])
weather1['date_time'] = pd.to_datetime(weather1['date_time'])
weather2['date_time'] = pd.to_datetime(weather2['date_time'])
plant1['source_key'] = plant1['source_key'].astype('category')
plant2['source_key'] = plant2['source_key'].astype('category')
weather1['source_key'] = weather1['source_key'].astype('category')
weather2['source_key'] = weather2['source_key'].astype('category')
I used the below code to determine the number of missing values in each column. I also used the below code to ensure correct data types.
When I ran the describe method, I noticed that the dc_power for plant 1 was 10 times higher than plant 2. I also noticed that the ac_power was the same for both plants. I determined that this was a decimal issue and corrected it by dividing the dc_power by 10 for plant 1. This will be completed in section "Correcting Plant 1's dc_power scale issue".
#Understanding null values and data types
print(plant1.info())
print(plant2.info())
print(weather1.info())
print(weather2.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 68778 entries, 0 to 68777 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_time 68778 non-null datetime64[ns] 1 plant_id 68778 non-null int64 2 source_key 68778 non-null category 3 dc_power 68778 non-null float64 4 ac_power 68778 non-null float64 5 daily_yield 68778 non-null float64 6 total_yield 68778 non-null float64 dtypes: category(1), datetime64[ns](1), float64(4), int64(1) memory usage: 3.2 MB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 67698 entries, 0 to 67697 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_time 67698 non-null datetime64[ns] 1 plant_id 67698 non-null int64 2 source_key 67698 non-null category 3 dc_power 67698 non-null float64 4 ac_power 67698 non-null float64 5 daily_yield 67698 non-null float64 6 total_yield 67698 non-null float64 dtypes: category(1), datetime64[ns](1), float64(4), int64(1) memory usage: 3.2 MB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 3182 entries, 0 to 3181 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_time 3182 non-null datetime64[ns] 1 plant_id 3182 non-null int64 2 source_key 3182 non-null category 3 ambient_temperature 3182 non-null float64 4 module_temperature 3182 non-null float64 5 irradiation 3182 non-null float64 dtypes: category(1), datetime64[ns](1), float64(3), int64(1) memory usage: 127.6 KB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 3259 entries, 0 to 3258 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_time 3259 non-null datetime64[ns] 1 plant_id 3259 non-null int64 2 source_key 3259 non-null category 3 ambient_temperature 3259 non-null float64 4 module_temperature 3259 non-null float64 5 irradiation 3259 non-null float64 dtypes: category(1), datetime64[ns](1), float64(3), int64(1) memory usage: 130.7 KB None
#Understanding statistics of the data
print(plant1.describe())
print(plant2.describe())
print(weather1.describe())
print(weather2.describe())
plant_id dc_power ac_power daily_yield total_yield
count 68778.0 68778.000000 68778.000000 68778.000000 6.877800e+04
mean 4135001.0 3147.426211 307.802752 3295.968737 6.978712e+06
std 0.0 4036.457169 394.396439 3145.178309 4.162720e+05
min 4135001.0 0.000000 0.000000 0.000000 6.183645e+06
25% 4135001.0 0.000000 0.000000 0.000000 6.512003e+06
50% 4135001.0 429.000000 41.493750 2658.714286 7.146685e+06
75% 4135001.0 6366.964286 623.618750 6274.000000 7.268706e+06
max 4135001.0 14471.125000 1410.950000 9163.000000 7.846821e+06
plant_id dc_power ac_power daily_yield total_yield
count 67698.0 67698.000000 67698.000000 67698.000000 6.769800e+04
mean 4136001.0 246.701961 241.277825 3294.890295 6.589448e+08
std 0.0 370.569597 362.112118 2919.448386 7.296678e+08
min 4136001.0 0.000000 0.000000 0.000000 0.000000e+00
25% 4136001.0 0.000000 0.000000 272.750000 1.996494e+07
50% 4136001.0 0.000000 0.000000 2911.000000 2.826276e+08
75% 4136001.0 446.591667 438.215000 5534.000000 1.348495e+09
max 4136001.0 1420.933333 1385.420000 9873.000000 2.247916e+09
plant_id ambient_temperature module_temperature irradiation
count 3182.0 3182.000000 3182.000000 3182.000000
mean 4135001.0 25.531606 31.091015 0.228313
std 0.0 3.354856 12.261222 0.300836
min 4135001.0 20.398505 18.140415 0.000000
25% 4135001.0 22.705182 21.090553 0.000000
50% 4135001.0 24.613814 24.618060 0.024653
75% 4135001.0 27.920532 41.307840 0.449588
max 4135001.0 35.252486 65.545714 1.221652
plant_id ambient_temperature module_temperature irradiation
count 3259.0 3259.000000 3259.000000 3259.000000
mean 4136001.0 28.069400 32.772408 0.232737
std 0.0 4.061556 11.344034 0.312693
min 4136001.0 20.942385 20.265123 0.000000
25% 4136001.0 24.602135 23.716881 0.000000
50% 4136001.0 26.981263 27.534606 0.019040
75% 4136001.0 31.056757 40.480653 0.438717
max 4136001.0 39.181638 66.635953 1.098766
Below are histograms to ensure that the distribution of the dc_power for both plants are as expected and there are no further issues after scaling the dc_power for plant 1.
#Histogram of DC Power for both plants to ensure the scales are the same.
print('Verify the scales of the DC Power are incorrect for Plant 1')
plant1['dc_power'].hist(alpha = 0.5, label = 'Plant 1')
plant2['dc_power'].hist(alpha = 0.5, label = 'Plant 2')
plt.title('DC Power Histogram for Plant 1 and Plant 2')
plt.show()
print(f'Average DC power for plant 1 before correction: {plant1.dc_power.mean().round(2)}')
print(f'Average DC power for plant 2 before correction: {plant2.dc_power.mean().round(2)}')
#Decimal Error in Plant 1's dc power
plant1['dc_power'] = plant1['dc_power']/10
print(f'Average DC power for plant 1 after correction: {plant1.dc_power.mean().round(2)}')
print(f'Average DC power for plant 2 after correction: {plant2.dc_power.mean().round(2)}')
print('Verify the scales of the DC Power are corrected for Plant 1')
plant1['dc_power'].hist(alpha = 0.5, label = 'Plant 1')
plant2['dc_power'].hist(alpha = 0.5, label = 'Plant 2')
plt.title('DC Power Histogram for Plant 1 and Plant 2')
plt.show()
Verify the scales of the DC Power are incorrect for Plant 1
Average DC power for plant 1 before correction: 3147.43 Average DC power for plant 2 before correction: 246.7 Average DC power for plant 1 after correction: 314.74 Average DC power for plant 2 after correction: 246.7 Verify the scales of the DC Power are corrected for Plant 1
Since the statistics don't show there are any additional issues between the data sources, I will merge the dataframes and continue the exploratory data analysis on the merged dataframe, including the weather data. The power generation data and the weather data were both sampled at 15 minute intervals. I will match those timestamps to ensure the data is aligned.
In this section, I also created a few calculated columns to help with the EDA portion of the project. I created a column for the month, day, hour, and x_time_stamp of the observation. I also created a column for the dc_power's ratio to the irradiation value. This column will help me remove the outliers from that dataset that was caused by the plant being shutdown due to overpowering the grid.
#Merge power generation and weather data
plant1Merged = plant1.merge(weather1, on='date_time', how = 'inner')
plant2Merged = plant2.merge(weather2, on='date_time', how = 'inner')
#Remove unnecessary columns
plant1Merged.drop(columns = 'source_key_y', axis = 1, inplace = True) #only one weather station per plant
plant2Merged.drop(columns = 'source_key_y', axis = 1, inplace = True) #only one weather station per plant
plant1Merged.drop(columns = 'plant_id_y', axis = 1, inplace = True) #duplicate
plant2Merged.drop(columns = 'plant_id_y', axis = 1, inplace = True) #duplicate
#Rename columns
plant1Merged.rename(columns = {'plant_id_x':'plant_id','source_key_x':'source_key'}, inplace = True)
plant2Merged.rename(columns = {'plant_id_x':'plant_id','source_key_x':'source_key'}, inplace = True)
#Concatenate the two dataframes
concatDFbeforeFilter = pd.concat([plant1Merged, plant2Merged])
concatDFbeforeFilter.reset_index(drop=True, inplace=True)
#Create new columns for future analysis
concatDFbeforeFilter['day'] = concatDFbeforeFilter['date_time'].dt.day_of_year
concatDFbeforeFilter['month'] = concatDFbeforeFilter['date_time'].dt.month
concatDFbeforeFilter['hour'] = concatDFbeforeFilter['date_time'].dt.hour
concatDFbeforeFilter['minute'] = concatDFbeforeFilter['date_time'].dt.minute/60
concatDFbeforeFilter['x_time_stamp'] = concatDFbeforeFilter['hour']+concatDFbeforeFilter['minute']
concatDFbeforeFilter['dc_power_to_irradiation_ratio'] = concatDFbeforeFilter['dc_power']/concatDFbeforeFilter['irradiation'] #This was added since this feature predicts when the plant was not producing power.
#Get a sample of the data to see what it looks like
concatDFbeforeFilter.sample(5)
| date_time | plant_id | source_key | dc_power | ac_power | daily_yield | total_yield | ambient_temperature | module_temperature | irradiation | day | month | hour | minute | x_time_stamp | dc_power_to_irradiation_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 58331 | 2020-05-21 09:30:00 | 4136001 | WcxssY2VbP4hApt | 949.526667 | 928.640000 | 1536.733333 | 1.817458e+08 | 29.269145 | 44.305649 | 0.678798 | 142 | 5 | 9 | 0.5 | 9.5 | 1398.835719 |
| 5211 | 2020-05-17 15:00:00 | 4135001 | 7JYdWkrLSPkdwr4 | 907.587500 | 887.737500 | 6332.500000 | 7.621870e+06 | 35.109772 | 54.054820 | 0.683636 | 138 | 5 | 15 | 0.0 | 15.0 | 1327.588414 |
| 96503 | 2020-06-10 00:00:00 | 4136001 | xoJJ8DcxJEcupym | 0.000000 | 0.000000 | 6993.000000 | 2.092928e+08 | 25.437388 | 24.202458 | 0.000000 | 162 | 6 | 0 | 0.0 | 0.0 | NaN |
| 16110 | 2020-05-23 12:00:00 | 4135001 | sjndEbLyjtCKgGv | 1248.328571 | 1217.285714 | 3992.571429 | 7.076955e+06 | 30.610838 | 58.550425 | 0.962423 | 144 | 5 | 12 | 0.0 | 12.0 | 1297.068118 |
| 96232 | 2020-06-09 21:00:00 | 4136001 | mqwcsP2rE7J0TFp | 0.000000 | 0.000000 | 8040.000000 | 5.937757e+08 | 27.385647 | 26.474708 | 0.000000 | 161 | 6 | 21 | 0.0 | 21.0 | NaN |
I will use the correlation matrix to determine which features are most correlated with the target variable. I will also use the correlation matrix to determine which features are highly correlated with each other. I will then use the correlation matrix to determine which features are most correlated with each other.
I will use my understanding of the process to remove several columns from the dataframe that are dependent on dc_power. This will include the ac_power, daily_yield, and total_yield columns. I will also remove the hour and minute columns since x_time_stamp is a calculation based off of those two columns.
In the correlation matrix below, the feature that is most correlated with the dc_power is irradiation. This makes sense because the intent of a solar panel is to convert radiation energy into electrical energy.
To further understand the correlation between the dc_power and irradiation, I will plot the data to see if there are any outliers. Outliers in this data imply that the plant was shut down due to overpowering the grid and should be removed from the analysis. I will compare the data before the outliers were removed to the data after the outliers were removed.
## Create a heatmap of the correlation between the columns
plt.figure(figsize = (20,20))
g = sns.heatmap(concatDFbeforeFilter.corr(numeric_only=True), annot = True, cmap = 'coolwarm', vmin = -1, vmax = 1)
g.set_title('Correlation Between the Columns', fontsize = 20)
Text(0.5, 1.0, 'Correlation Between the Columns')
#Scatter plot of DC Power vs Irradiation to determine if there are outliers that should be removed
sns.scatterplot(x = 'irradiation', y = 'dc_power', data = concatDFbeforeFilter, hue = 'plant_id', alpha = 0.05, palette = 'coolwarm')
plt.title('DC_power vs irradiation', fontsize = 20)
Text(0.5, 1.0, 'DC_power vs irradiation')
##Pair plot to determine if there are any other obvious data quality issues
g = sns.pairplot(concatDFbeforeFilter[['dc_power','ac_power','irradiation','ambient_temperature','module_temperature','dc_power_to_irradiation_ratio','x_time_stamp']], plot_kws={'alpha': 0.05})
g.fig.suptitle('Pair Plot of the Columns', fontsize = 20)
Text(0.5, 0.98, 'Pair Plot of the Columns')
In the above analysis, we determined that there were significant outliers in ratio of dc_power to irradiation. This is due to the grid not being able to handle the power generated by the solar panels. This required the panels to be shutdown.
Since that data isn't important for the analysis of predicting the potential power, I will remove these outliers from the dataset. This will include removing the rows where the dc_power is 0 and the rows that are adjacent to the rows where the dc_power is 0. I chose to remove the adjacent rows because I will not be able to determine if the inverter was shut off at specific 15 minute intervals or if the inverter was shut off sometime between the last data sample. This will ensure that I don't have observations where the inverter was only active a portion of the duration.
I will also remove where there are outliers in the ratio of dc_power to irradiation since that also implies that the solar panels were shut down.
After I removed the outliers I plotted the data again to ensure that the there were no other issues that I noticed in the data.
#Remove the rows adjacent to the dc_power = 0 rows to ensure all partial samples are removed and remove the impossible ratio values
concatDFbeforeFilter.sort_values(by = ['source_key','date_time'], ascending = [True,True], inplace = True)
concatDFbeforeFilter["dc_power_shifted_positive"] = concatDFbeforeFilter["dc_power"].shift(-1)
concatDFbeforeFilter["dc_power_shifted_negative"] = concatDFbeforeFilter["dc_power"].shift(1)
concatDFFilterAdjacent = concatDFbeforeFilter[(concatDFbeforeFilter['dc_power']>0) &
(concatDFbeforeFilter['dc_power_shifted_positive']>0) &
(concatDFbeforeFilter['dc_power_shifted_negative'] > 0) &
(concatDFbeforeFilter['dc_power_to_irradiation_ratio']< math.inf)]
##Data without outliers and rename for clarity in future models
data = filter_outliers(concatDFFilterAdjacent,'dc_power_to_irradiation_ratio', threshold = 3)
#Scatter plot of DC Power vs Irradiation before the outlier removal
sns.scatterplot(x = 'irradiation', y = 'dc_power', data = concatDFbeforeFilter, hue = 'plant_id', alpha = 0.05, palette = 'coolwarm')
plt.title('DC_power vs Irradiation before outlier removal', fontsize = 20)
plt.show()
#Scatter plot of DC Power vs Irradiation after the outlier removal
sns.scatterplot(x = 'irradiation', y = 'dc_power', data = data, hue = 'plant_id', alpha = 0.05, palette = 'coolwarm')
plt.title('DC_power vs Irradiation after outlier removal', fontsize = 20)
plt.show()
data.columns
Index(['date_time', 'plant_id', 'source_key', 'dc_power', 'ac_power',
'daily_yield', 'total_yield', 'ambient_temperature',
'module_temperature', 'irradiation', 'day', 'month', 'hour', 'minute',
'x_time_stamp', 'dc_power_to_irradiation_ratio',
'dc_power_shifted_positive', 'dc_power_shifted_negative'],
dtype='object')
##Pair plot to determine if there are any other obvious data quality issues after the removal of the outliers in dc_power_to_irradiation_ratio
g = sns.pairplot(data[['dc_power','ac_power','irradiation','ambient_temperature','module_temperature','dc_power_to_irradiation_ratio','x_time_stamp']], plot_kws={'alpha': 0.05})
g.fig.suptitle('Pair Plot of the Columns', fontsize = 20)
Text(0.5, 0.98, 'Pair Plot of the Columns')
Since the x_time_stamp (hours from midnight) isn't linear with the dc_power, I will transform the x_time_stamp into a polynomial feature after I run the model without the polynomial features. This will be completed in the section titled "Polynomial Features/Polynomial Regression".
#Plot the dc_power and irradiation to understand the relationship between the x_time_stamp
sns.lineplot(data = data, x = 'x_time_stamp',y = 'dc_power', hue = 'source_key', alpha = 0.1, legend = False)
ax2 = plt.gca().twinx()
sns.lineplot(data = data, x = 'x_time_stamp',y = 'irradiation', ax = ax2, color = 'red')
<AxesSubplot: xlabel='x_time_stamp', ylabel='irradiation'>
Upon analysis of the collinearity it was determined that all of the source_keys and plant_ids were highly correlated with each other. This is expected since the source_key is the serial number of the inverter and the plant_id is the plant where the inverter is located. However, since part of the problem that I am attempting to solve is understanding when hardware starts to fail, I will keep the source_key and plant_id in the dataset. I will use other techniques to determine what features aren't significant in the model.
#Computer the VIF for the features
import warnings
warnings.filterwarnings('ignore') #There are several perfect multicollinearity warnings that can be ignored
vif_data = data[['plant_id','source_key','irradiation','ambient_temperature','module_temperature','day','month','x_time_stamp']]
vif_data = pd.get_dummies(vif_data, columns = ['plant_id','source_key'])
vif = pd.DataFrame()
features_list = list(vif_data.columns)
vif['Features'] = features_list
## Drop nan and inf values
concatDFbeforeFilterforVIC = vif_data[features_list].replace([np.inf, -np.inf], np.nan).dropna()
## Compute the VIF
vif['VIF'] = [variance_inflation_factor(vif_data, feature_name = features_list[i],print_summary=False) for i in range(vif_data[features_list].shape[1])]
## Sort the VIF values and report their VIF values
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = True)
#Print the VIF values for features that are not infinite. All the source_key features are infinite
vif[vif['VIF']<np.inf]
| Features | VIF | |
|---|---|---|
| 5 | x_time_stamp | 2.80 |
| 3 | day | 4.39 |
| 4 | month | 4.97 |
| 1 | ambient_temperature | 8.84 |
| 0 | irradiation | 12.01 |
| 2 | module_temperature | 20.99 |
The biggest difficulty that I faced in the data cleaning and EDA portion of this report was trying to find a way to best remove the outliers that were seen from the plant being taken offline during periods of excessive power production. I chose to remove the rows where the dc_power was 0 and the rows that were adjacent to the rows where the dc_power was 0. I also chose to do some feature engineering by creating the dc_power_to_irradiation_ratio. This decision was made based off of domain knowledge and the scatter plots that highlighted the outliers in the data. Once I removed the outliers, I just ran a little test of fitting a linear regression model. The data without the outliers had an r-squared above 0.978 and the data with the outliers had an r-squared of 0.767. This shows that the outliers were having a significant impact on the model's performance. This will be shown in the "Linear Regression" section of this report.
I also discovered and expected collinearity with the source_key for the inverters; however, I chose to keep them in the model to provide a baseline for future expectations of performance. This was necessary if this model is expected to be able to provide information on hardware failure. Despite the irradiation being the primary predictor, the serial numbers were show to still be statistically significant for the model's performance as shown in the "Linear Regression" section of this report.
I will use multiple models to try and predict the dc_power. I will use the following models:
In order to evaluate the models, I will print out the r-squared score of the model in addition to plotting the residuals. I chose the r^2 metric because it is a measure of how well future samples are likely to be predicted by the model. Once I have the model, I will use the test data to evaluate the model's ability to generalize the data. This is to ensure that the model isn't over-fitting the data. In the results section, I will compare the models to each other and determine which model is the best fit for the data.
#Model location filepath
model_path = 'models'
#Build X and y datasets without outliers. Dropped the columns that are not needed for the
#model because they are either calculations of the data or colinear with other features (ex. hour and minute with x_time_stamp)
X = data.drop(columns = ['dc_power','date_time','plant_id','ac_power','daily_yield','total_yield',
'hour','minute','dc_power_to_irradiation_ratio','dc_power_shifted_positive','dc_power_shifted_negative'])
# One hot encoding for categorical variable source_key (inverter serial number)
X = pd.get_dummies(X, columns = ['source_key'])
#Target variable
y = data['dc_power']
#Split the data into train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)
#Add constant for statsmodels
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
##Data with outliers
data_with_outliers = concatDFbeforeFilter
data_with_outliers.drop(columns=['date_time','plant_id','ac_power','daily_yield','total_yield',
'hour','minute','dc_power_to_irradiation_ratio','dc_power_shifted_positive','dc_power_shifted_negative'], inplace = True)
X_outliers = data_with_outliers.drop(columns = ['dc_power'])
X_outliers = pd.get_dummies(X_outliers, columns = ['source_key'])
y_outliers = data_with_outliers['dc_power']
X_train_outliers,X_test_outliers,y_train_outliers,y_test_outliers = train_test_split(X_outliers,y_outliers,test_size = 0.2, random_state = 42)
X_train_outliers = sm.add_constant(X_train_outliers)
X_test_outliers = sm.add_constant(X_test_outliers)
Summary of the linear regression models:
##Select best features from the data without outliers
best_features, ols_model = forward_selection(X_train,y_train)
#Save the model
joblib.dump(ols_model, os.path.join(model_path, 'ols_model.pkl'))
##Print the information about the best features
print("Feature, p-value, adjusted R^2:\n")
for key, value in best_features.items():
print(key, value)
Feature, p-value, adjusted R^2: irradiation (0.0, 0.9746659792698716) source_key_Quc1TzYxW2pYoWX (0.0, 0.9757064068251208) month (3.6498935499972564e-186, 0.9761912815681737) ambient_temperature (6.062068956151132e-293, 0.9769381545533012) source_key_bvBOhCH3iADSZry (1.4828508644945337e-116, 0.9772253506442744) source_key_Et9kgGMDl729KT4 (1.6713623129998394e-109, 0.9774915921034207) source_key_1BY6WEcLGh8j5v7 (9.447278930559111e-95, 0.9777187486072327) source_key_rrq4fwE8jgrTyWY (1.046643227579298e-43, 0.9778200852707739) source_key_LYwnQax7tkwH5Cb (1.6860481910725233e-25, 0.9778770800641564) module_temperature (1.2943172942187097e-24, 0.9779318046919271) source_key_adLQvlD726eNBSB (3.6910566543855494e-20, 0.9779757354280458) source_key_oZ35aAeoifZaQzV (6.636954369781763e-19, 0.9780165869743764) source_key_4UPUqMRk7TRMgml (3.110022101249188e-20, 0.978060528172972) source_key_ih0vzX44oOqAx2f (8.039942606867147e-15, 0.9780915399718374) source_key_PeE6FRyGXUgsRhN (4.804455730066444e-15, 0.9781230374056196) source_key_q49J1IKaHRwDQnt (5.313406947709377e-16, 0.9781567495490511) source_key_Mx2yZCDsyf6DPfv (9.617595444599415e-12, 0.978180386004802) day (1.2111430225198995e-09, 0.9781990835302574) source_key_Qf4GUc1pJu5T6c6 (1.17168346360562e-09, 0.9782177990248833) source_key_1IF53ai7Xc0U56Y (3.209829969186449e-09, 0.9782354797282728) source_key_VHMLBKoKgIrUVDU (9.260926876919753e-09, 0.978252077278099) source_key_McdE0feGgRqW7Ca (1.7586849015451918e-08, 0.978268016685517) source_key_IQ2d7wF4YD8zU1Q (3.1419042934002934e-08, 0.9782833614214528) source_key_3PZuoBAID5Wc2HD (2.6901372823861577e-07, 0.9782965438131855) source_key_WcxssY2VbP4hApt (1.4065139697891365e-06, 0.9783080690146889) x_time_stamp (3.926302579976278e-07, 0.9783208595651998) source_key_oZZkBaNadn6DNKz (3.6392950615195897e-06, 0.9783314288747313) source_key_vOuJvMaM2sgwLmb (8.486647800967697e-07, 0.9783434386405148) source_key_V94E5Ben1TlhnDV (0.00047867307668236206, 0.9783492216583867) source_key_ZoEaEvLYb1n2sOq (0.0009466337115072633, 0.978354348700926) source_key_zBIq5rxdHJRwDNY (0.000987660738337423, 0.9783594340868867) source_key_LlT2YUhhzqhg5Sw (0.0015670062929069625, 0.9783640784639696) source_key_NgDl19wMapZy17u (0.005121876105121559, 0.9783676062329125) source_key_wCURE6d3bPkepu2 (0.00779896374000723, 0.9783707430212899) source_key_iCRJl6heRkivqQ3 (0.007824096444038852, 0.9783738764554692) source_key_mqwcsP2rE7J0TFp (0.01075350939294249, 0.9783767166703227) source_key_xMbIugepa2P7lBB (0.004711428432185575, 0.9783803206523747) source_key_ZnxXDlPa8U1GXgE (0.011228922342866467, 0.9783831205089173) source_key_uHbuxQJl8lW7ozc (0.026515584931637, 0.9783851431417571) source_key_zVJPv84UY57bAof (0.04878212895730191, 0.9783866297497644)
#Print a summary of the OLS model
ols_model = joblib.load(os.path.join(model_path, 'ols_model.pkl'))
print(ols_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.978
Model: OLS Adj. R-squared: 0.978
Method: Least Squares F-statistic: 4.748e+04
Date: Mon, 27 Feb 2023 Prob (F-statistic): 0.00
Time: 11:46:51 Log-Likelihood: -2.2805e+05
No. Observations: 41956 AIC: 4.562e+05
Df Residuals: 41915 BIC: 4.565e+05
Df Model: 40
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -170.8368 6.439 -26.532 0.000 -183.457 -158.216
irradiation 1322.9859 3.254 406.606 0.000 1316.609 1329.363
source_key_Quc1TzYxW2pYoWX -84.2079 1.905 -44.203 0.000 -87.942 -80.474
month 13.4131 1.207 11.114 0.000 11.048 15.779
ambient_temperature 3.3613 0.191 17.561 0.000 2.986 3.736
source_key_bvBOhCH3iADSZry -42.9078 2.020 -21.247 0.000 -46.866 -38.949
source_key_Et9kgGMDl729KT4 -37.1498 1.913 -19.419 0.000 -40.899 -33.400
source_key_1BY6WEcLGh8j5v7 -35.9238 2.014 -17.833 0.000 -39.872 -31.975
source_key_rrq4fwE8jgrTyWY -18.8427 1.844 -10.216 0.000 -22.458 -15.228
source_key_LYwnQax7tkwH5Cb -12.5030 1.877 -6.662 0.000 -16.182 -8.824
module_temperature -0.9485 0.111 -8.537 0.000 -1.166 -0.731
source_key_adLQvlD726eNBSB 21.9749 1.998 10.996 0.000 18.058 25.892
source_key_oZ35aAeoifZaQzV 21.7787 1.771 12.299 0.000 18.308 25.249
source_key_4UPUqMRk7TRMgml 21.6723 1.776 12.205 0.000 18.192 25.153
source_key_ih0vzX44oOqAx2f -11.3601 2.009 -5.654 0.000 -15.298 -7.422
source_key_PeE6FRyGXUgsRhN -6.4632 1.786 -3.619 0.000 -9.964 -2.963
source_key_q49J1IKaHRwDQnt -6.7150 1.820 -3.689 0.000 -10.283 -3.147
source_key_Mx2yZCDsyf6DPfv 18.4344 1.763 10.455 0.000 14.978 21.890
day 0.3278 0.055 5.968 0.000 0.220 0.435
source_key_Qf4GUc1pJu5T6c6 16.5581 1.764 9.387 0.000 13.101 20.015
source_key_1IF53ai7Xc0U56Y 15.1059 1.985 7.608 0.000 11.214 18.997
source_key_VHMLBKoKgIrUVDU 14.3113 1.993 7.181 0.000 10.405 18.218
source_key_McdE0feGgRqW7Ca 13.6534 1.984 6.881 0.000 9.765 17.542
source_key_IQ2d7wF4YD8zU1Q 15.4826 1.965 7.879 0.000 11.631 19.334
source_key_3PZuoBAID5Wc2HD 12.1412 2.002 6.065 0.000 8.217 16.065
source_key_WcxssY2VbP4hApt 12.8624 1.854 6.939 0.000 9.229 16.495
x_time_stamp 0.7452 0.120 6.216 0.000 0.510 0.980
source_key_oZZkBaNadn6DNKz 10.9314 1.768 6.182 0.000 7.466 14.397
source_key_vOuJvMaM2sgwLmb 10.9164 1.793 6.089 0.000 7.402 14.430
source_key_V94E5Ben1TlhnDV 7.8378 1.767 4.434 0.000 4.374 11.302
source_key_ZoEaEvLYb1n2sOq -4.5188 1.990 -2.271 0.023 -8.419 -0.618
source_key_zBIq5rxdHJRwDNY -4.1860 2.005 -2.088 0.037 -8.116 -0.256
source_key_LlT2YUhhzqhg5Sw -2.7406 1.786 -1.534 0.125 -6.242 0.761
source_key_NgDl19wMapZy17u 7.8276 1.989 3.935 0.000 3.929 11.726
source_key_wCURE6d3bPkepu2 7.1605 1.968 3.639 0.000 3.304 11.017
source_key_iCRJl6heRkivqQ3 6.9412 2.013 3.449 0.001 2.996 10.886
source_key_mqwcsP2rE7J0TFp 6.4711 1.998 3.239 0.001 2.555 10.387
source_key_xMbIugepa2P7lBB 6.4086 1.988 3.223 0.001 2.512 10.305
source_key_ZnxXDlPa8U1GXgE 5.7560 1.996 2.884 0.004 1.845 9.667
source_key_uHbuxQJl8lW7ozc 4.7270 1.977 2.391 0.017 0.853 8.601
source_key_zVJPv84UY57bAof 3.9630 2.011 1.971 0.049 0.021 7.905
==============================================================================
Omnibus: 6728.596 Durbin-Watson: 2.003
Prob(Omnibus): 0.000 Jarque-Bera (JB): 54661.487
Skew: -0.540 Prob(JB): 0.00
Kurtosis: 8.487 Cond. No. 3.86e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.86e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
##Plot the test vs predicted values for the OLS model
features_list = list(ols_model.params.index)
y_pred_ols = ols_model.predict(X_test[features_list])
#Determine the ideal line for the plot
x = np.linspace(0, max(y_pred_ols), 1000)
y = x
print(f'Rsquared Score: {r2_score(y_test, y_pred_ols)}')
plt.scatter(x = y_pred_ols, y = y_test, alpha = 0.1, label = 'Test Values')
plt.plot(x, y, color = 'red', label = 'Predicted = Test')
plt.title('Test vs Predicted Values', fontsize = 20)
plt.text(0, 0.8*max(y_pred_ols), f'R^2 = {round(r2_score(y_test, y_pred_ols),4)}', fontsize = 10)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.legend()
plt.savefig('images/ols_test_vs_predicted.png')
plt.show()
Rsquared Score: 0.9771782653041474
#Plot the test and predicted values vs irradiation
plot_actual_vs_predicted(ols_model, X_test, y_test, 'irradiation')
#Run the forward selection algorithm on the data with outliers
best_features_with_outliers,ols_model_outlier = forward_selection(X_train_outliers,y_train_outliers)
#Save the model
joblib.dump(ols_model_outlier, os.path.join(model_path, 'ols_model_with_outliers.pkl'))
##Print the information about the best features
print("Feature, p-value, adjusted R^2:\n")
for key, value in best_features_with_outliers.items():
print(key, value)
Feature, p-value, adjusted R^2: irradiation (0.0, 0.7504714808699537) source_key_Quc1TzYxW2pYoWX (9.196909038366164e-165, 0.7525191140745627) module_temperature (2.1397293098366064e-163, 0.7545329558236192) ambient_temperature (1.005677491826219e-308, 0.7583187007897205) source_key_Et9kgGMDl729KT4 (1.685036492328268e-112, 0.7596660933981063) source_key_LYwnQax7tkwH5Cb (1.090688806789156e-99, 0.7608507948870082) source_key_rrq4fwE8jgrTyWY (5.520711067202813e-75, 0.7617317292535072) source_key_q49J1IKaHRwDQnt (9.298262577357635e-52, 0.7623302816401254) source_key_81aHJ1q11NBPMrL (7.050678460491541e-43, 0.7628209226154267) source_key_xoJJ8DcxJEcupym (6.023719465786919e-36, 0.7632277073534982) source_key_9kRcWv60rDACzjR (9.037981194855572e-37, 0.7636436251019996) source_key_WcxssY2VbP4hApt (1.077632388684895e-35, 0.7640460001420118) source_key_LlT2YUhhzqhg5Sw (5.276898748097564e-36, 0.764451380435982) source_key_PeE6FRyGXUgsRhN (1.019680938742181e-34, 0.7648408090724541) source_key_oZZkBaNadn6DNKz (6.059061669311676e-30, 0.7651730746734661) source_key_V94E5Ben1TlhnDV (1.8910946454157478e-26, 0.7654635953886961) source_key_vOuJvMaM2sgwLmb (1.664521682026185e-28, 0.7657780070004092) source_key_oZ35aAeoifZaQzV (3.5990876183940565e-12, 0.7659002269146264) source_key_4UPUqMRk7TRMgml (1.6927717361316809e-13, 0.7660378675856081) day (7.415422052760818e-12, 0.766156300152868) month (1.242773648938708e-20, 0.7663772805084007) source_key_Qf4GUc1pJu5T6c6 (3.623839877541644e-10, 0.7664759498933165) source_key_bvBOhCH3iADSZry (4.106704390653546e-09, 0.766562393982776) source_key_Mx2yZCDsyf6DPfv (6.068253545667054e-08, 0.7666353449397258) source_key_1BY6WEcLGh8j5v7 (1.0098847041676737e-07, 0.7667057356982433) x_time_stamp (1.33332592887237e-06, 0.7667632955101) source_key_NgDl19wMapZy17u (0.0028323055991187644, 0.766783647647396) source_key_adLQvlD726eNBSB (0.004817629686932197, 0.766801514304112) source_key_mqwcsP2rE7J0TFp (0.04692325115415324, 0.7668090965410564)
#Print a summary of the OLS model with outliers
ols_model_outlier = joblib.load(os.path.join(model_path, 'ols_model_with_outliers.pkl'))
print(ols_model_outlier.summary())
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.767
Model: OLS Adj. R-squared: 0.767
Method: Least Squares F-statistic: 1.029e+04
Date: Mon, 27 Feb 2023 Prob (F-statistic): 0.00
Time: 11:15:23 Log-Likelihood: -6.0335e+05
No. Observations: 90702 AIC: 1.207e+06
Df Residuals: 90672 BIC: 1.207e+06
Df Model: 29
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -126.6742 12.992 -9.750 0.000 -152.138 -101.210
irradiation 764.4442 9.031 84.648 0.000 746.744 782.145
source_key_Quc1TzYxW2pYoWX -139.7220 3.902 -35.803 0.000 -147.371 -132.073
module_temperature 10.3801 0.317 32.771 0.000 9.759 11.001
ambient_temperature -6.0410 0.445 -13.577 0.000 -6.913 -5.169
source_key_Et9kgGMDl729KT4 -124.4040 3.933 -31.632 0.000 -132.112 -116.696
source_key_LYwnQax7tkwH5Cb -115.1993 3.892 -29.601 0.000 -122.827 -107.571
source_key_rrq4fwE8jgrTyWY -101.1837 3.869 -26.152 0.000 -108.767 -93.600
source_key_q49J1IKaHRwDQnt -87.2660 3.896 -22.399 0.000 -94.902 -79.630
source_key_81aHJ1q11NBPMrL -79.9898 3.902 -20.502 0.000 -87.637 -72.343
source_key_xoJJ8DcxJEcupym -73.2844 3.878 -18.899 0.000 -80.885 -65.684
source_key_9kRcWv60rDACzjR -71.7109 3.875 -18.505 0.000 -79.306 -64.115
source_key_WcxssY2VbP4hApt -69.2242 3.897 -17.761 0.000 -76.863 -61.585
source_key_LlT2YUhhzqhg5Sw -67.3550 3.916 -17.202 0.000 -75.029 -59.680
source_key_PeE6FRyGXUgsRhN -64.3249 3.917 -16.424 0.000 -72.001 -56.648
source_key_oZZkBaNadn6DNKz -58.1901 3.886 -14.973 0.000 -65.807 -50.573
source_key_V94E5Ben1TlhnDV -53.1145 3.883 -13.678 0.000 -60.726 -45.503
source_key_vOuJvMaM2sgwLmb -52.5439 3.889 -13.509 0.000 -60.167 -44.921
source_key_oZ35aAeoifZaQzV -35.4877 3.903 -9.093 0.000 -43.137 -27.838
source_key_4UPUqMRk7TRMgml -35.6424 3.943 -9.040 0.000 -43.370 -27.915
day 1.3797 0.126 10.968 0.000 1.133 1.626
month -20.4615 2.714 -7.540 0.000 -25.780 -15.143
source_key_Qf4GUc1pJu5T6c6 -30.5174 3.944 -7.737 0.000 -38.248 -22.787
source_key_bvBOhCH3iADSZry -29.5287 4.686 -6.302 0.000 -38.713 -20.344
source_key_Mx2yZCDsyf6DPfv -24.4758 3.917 -6.249 0.000 -32.153 -16.799
source_key_1BY6WEcLGh8j5v7 -25.0517 4.702 -5.328 0.000 -34.267 -15.836
x_time_stamp -0.5312 0.103 -5.150 0.000 -0.733 -0.329
source_key_NgDl19wMapZy17u -13.7397 4.521 -3.039 0.002 -22.601 -4.878
source_key_adLQvlD726eNBSB 13.0069 4.757 2.734 0.006 3.683 22.331
source_key_mqwcsP2rE7J0TFp -9.0325 4.546 -1.987 0.047 -17.942 -0.123
==============================================================================
Omnibus: 60996.691 Durbin-Watson: 1.998
Prob(Omnibus): 0.000 Jarque-Bera (JB): 840869.446
Skew: -3.131 Prob(JB): 0.00
Kurtosis: 16.538 Cond. No. 3.36e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.36e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
##Plot the test vs predicted values for the OLS model with outliers
features_list = list(ols_model_outlier.params.index)
y_pred_outliers = ols_model_outlier.predict(X_test_outliers[features_list])
#Determine the ideal line for the plot
x = np.linspace(0, max(y_pred_outliers), 1000)
y = x
print(f'Rsquared Socre: {r2_score(y_test_outliers, y_pred_outliers)}')
plt.scatter(x = y_pred_outliers, y = y_test_outliers, alpha = 0.1, label = 'Test Values')
plt.plot(x, y, color = 'red', label = 'Predicted = Test')
plt.title('Test vs Predicted Values with Outliers', fontsize = 20)
plt.text(0, 0.8*max(y_pred_outliers), f'R^2 = {round(r2_score(y_test_outliers, y_pred_outliers),4)}', fontsize = 10)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.legend()
plt.savefig('images/ols_test_with_outliers_vs_predicted.png')
plt.show()
Rsquared Socre: 0.7663317560994336
I decided to use a polynomial feature to see if it would improve the model's performance because the x_time_stamp is not linear with the dc_power. The nature of this relationship is shown in the line graph below (Mean DC Power vs Time Stamp). I used the below code to create the polynomial features. I then ran the forward selection model to determine which features were significant in the model and printed the summary.
Looking at the graphs of the DC Power vs Time Stamp, it is clear that the relationship between the two variables is not linear. The polynomial feature iteration was able to determine that a 6 degree polynomial feature best fit the data. As the number of degrees increased, the p-value for the features implied that the feature was not statistically significant. I will then use that information to perform a polynomial regression with the x_time_stamp feature transformed into a 6 degree polynomial feature for future models (Polynomial Regression and Decision Trees).
#Plot the average values of the dc_power for each time stamp
data_dc_power_vs_x_time_stamp = data[['dc_power','x_time_stamp']]
df_mean = data_dc_power_vs_x_time_stamp.groupby('x_time_stamp').mean()
plt.plot(df_mean.index, df_mean['dc_power'])
plt.xlabel('Time Stamp')
plt.ylabel('Mean DC Power')
plt.title('Mean DC Power vs Time Stamp')
plt.show()
##Create a polynomial regression model to predict the dc_power based on the time stamp using the mean dataframe to speed up the iteration process
from sklearn.preprocessing import PolynomialFeatures
## Iterate through a range of degrees to get the predicted OLS model and plot the actual vs predicted values
X_train_timestamp, X_test_timestamp,y_train_timestamp,y_test_timestamp = train_test_split(df_mean.index,df_mean['dc_power'],test_size = 0.2, random_state = 42)
X_train_timestamp = pd.DataFrame(X_train_timestamp)
X_test_timestamp = pd.DataFrame(X_test_timestamp)
y_train_timestamp = pd.Series(y_train_timestamp)
y_test_timestamp = pd.Series(y_test_timestamp)
from sklearn.preprocessing import PolynomialFeatures
#Iterate through a range of degrees to get the predicted OLS model and plot the actual vs predicted values
for i in range(1,11):
print("Degree = ", i)
#if i == 1:
# X_train_timestamp.reshape(-1,1)
# X_test_timestamp.reshape(-1,1)
X_train_poly, X_test_poly = polynomial_features(X_train_timestamp, X_test_timestamp, i)
ols_model = sm.OLS(endog = y_train_timestamp, exog = X_train_poly)
poly_model = ols_model.fit()
print(poly_model.summary())
#Plot the actual vs predicted values
plt.figure()
sns.scatterplot(x = X_test_timestamp.values.flatten(), y = y_test_timestamp.values)
sns.scatterplot(x = X_test_timestamp.values.flatten(), y = poly_model.predict(X_test_poly), legend = True)
plt.legend(['Actual','Predicted'])
plt.xlabel('Time Stamp')
plt.ylabel('DC Power')
plt.title(f'DC Power vs Time Stamp (degree {i})')
plt.show()
print("\n")
Degree = 1
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.001
Model: OLS Adj. R-squared: -0.026
Method: Least Squares F-statistic: 0.02677
Date: Sat, 25 Feb 2023 Prob (F-statistic): 0.871
Time: 20:41:22 Log-Likelihood: -290.29
No. Observations: 40 AIC: 584.6
Df Residuals: 38 BIC: 588.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 595.2769 188.009 3.166 0.003 214.673 975.880
x1 -2.4183 14.780 -0.164 0.871 -32.339 27.502
==============================================================================
Omnibus: 17.505 Durbin-Watson: 2.246
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3.510
Skew: -0.215 Prob(JB): 0.173
Kurtosis: 1.614 Cond. No. 43.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\Megan\AppData\Local\Temp\ipykernel_4512\4086418569.py:250: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access X_train.poly = pd.DataFrame(X_train_poly) C:\Users\Megan\AppData\Local\Temp\ipykernel_4512\4086418569.py:251: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access X_test.poly = pd.DataFrame(X_test_poly)
Degree = 2
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.972
Model: OLS Adj. R-squared: 0.971
Method: Least Squares F-statistic: 642.6
Date: Sat, 25 Feb 2023 Prob (F-statistic): 1.85e-29
Time: 20:41:22 Log-Likelihood: -218.78
No. Observations: 40 AIC: 443.6
Df Residuals: 37 BIC: 448.6
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -3078.0426 107.345 -28.674 0.000 -3295.544 -2860.541
x1 660.3714 18.664 35.383 0.000 622.555 698.187
x2 -27.0657 0.755 -35.837 0.000 -28.596 -25.535
==============================================================================
Omnibus: 2.711 Durbin-Watson: 2.346
Prob(Omnibus): 0.258 Jarque-Bera (JB): 2.526
Skew: 0.577 Prob(JB): 0.283
Kurtosis: 2.570 Cond. No. 2.16e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Degree = 3
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.974
Model: OLS Adj. R-squared: 0.972
Method: Least Squares F-statistic: 447.4
Date: Sat, 25 Feb 2023 Prob (F-statistic): 1.55e-28
Time: 20:41:22 Log-Likelihood: -217.40
No. Observations: 40 AIC: 442.8
Df Residuals: 36 BIC: 449.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -3641.2284 366.933 -9.923 0.000 -4385.403 -2897.054
x1 816.6594 99.255 8.228 0.000 615.362 1017.957
x2 -40.5717 8.463 -4.794 0.000 -57.735 -23.408
x3 0.3673 0.229 1.602 0.118 -0.098 0.832
==============================================================================
Omnibus: 3.117 Durbin-Watson: 2.499
Prob(Omnibus): 0.210 Jarque-Bera (JB): 2.772
Skew: 0.634 Prob(JB): 0.250
Kurtosis: 2.770 Cond. No. 1.22e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.22e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Degree = 4
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 4094.
Date: Sat, 25 Feb 2023 Prob (F-statistic): 3.33e-46
Time: 20:41:23 Log-Likelihood: -167.30
No. Observations: 40 AIC: 344.6
Df Residuals: 35 BIC: 353.0
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3450.8830 372.908 9.254 0.000 2693.840 4207.926
x1 -1829.5530 136.430 -13.410 0.000 -2106.520 -1552.586
x2 311.7623 17.925 17.392 0.000 275.372 348.153
x3 -19.5585 1.006 -19.434 0.000 -21.602 -17.515
x4 0.4059 0.020 19.842 0.000 0.364 0.447
==============================================================================
Omnibus: 0.763 Durbin-Watson: 2.115
Prob(Omnibus): 0.683 Jarque-Bera (JB): 0.769
Skew: 0.102 Prob(JB): 0.681
Kurtosis: 2.352 Cond. No. 7.28e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.28e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Degree = 5
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 3324.
Date: Sat, 25 Feb 2023 Prob (F-statistic): 1.09e-44
Time: 20:41:23 Log-Likelihood: -166.43
No. Observations: 40 AIC: 344.9
Df Residuals: 34 BIC: 355.0
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4997.7971 1311.357 3.811 0.001 2332.798 7662.796
x1 -2561.5417 610.498 -4.196 0.000 -3802.223 -1320.861
x2 445.0596 109.854 4.051 0.000 221.810 668.309
x3 -31.2540 9.564 -3.268 0.002 -50.689 -11.818
x4 0.9016 0.404 2.234 0.032 0.081 1.722
x5 -0.0081 0.007 -1.230 0.227 -0.022 0.005
==============================================================================
Omnibus: 1.149 Durbin-Watson: 2.289
Prob(Omnibus): 0.563 Jarque-Bera (JB): 0.985
Skew: 0.154 Prob(JB): 0.611
Kurtosis: 2.295 Cond. No. 4.54e+08
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.54e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
Degree = 6
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 3619.
Date: Sat, 25 Feb 2023 Prob (F-statistic): 4.96e-45
Time: 20:41:23 Log-Likelihood: -160.49
No. Observations: 40 AIC: 335.0
Df Residuals: 33 BIC: 346.8
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.877e+04 4236.018 4.431 0.000 1.01e+04 2.74e+04
x1 -1.035e+04 2366.819 -4.372 0.000 -1.52e+04 -5532.647
x2 2221.8395 534.855 4.154 0.000 1133.670 3310.009
x3 -240.9264 62.650 -3.846 0.001 -368.389 -113.463
x4 14.4202 4.019 3.588 0.001 6.244 22.596
x5 -0.4606 0.134 -3.435 0.002 -0.733 -0.188
x6 0.0062 0.002 3.377 0.002 0.002 0.010
==============================================================================
Omnibus: 5.205 Durbin-Watson: 2.269
Prob(Omnibus): 0.074 Jarque-Bera (JB): 3.934
Skew: 0.575 Prob(JB): 0.140
Kurtosis: 4.018 Cond. No. 3.01e+10
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.01e+10. This might indicate that there are
strong multicollinearity or other numerical problems.
Degree = 7
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 3008.
Date: Sat, 25 Feb 2023 Prob (F-statistic): 3.14e-43
Time: 20:41:23 Log-Likelihood: -160.49
No. Observations: 40 AIC: 337.0
Df Residuals: 32 BIC: 350.5
Df Model: 7
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.869e+04 1.6e+04 1.166 0.252 -1.4e+04 5.13e+04
x1 -1.03e+04 1.05e+04 -0.984 0.332 -3.16e+04 1.1e+04
x2 2207.7714 2855.150 0.773 0.445 -3607.979 8023.521
x3 -238.8301 422.477 -0.565 0.576 -1099.388 621.728
x4 14.2374 36.648 0.388 0.700 -60.412 88.887
x5 -0.4512 1.866 -0.242 0.810 -4.252 3.349
x6 0.0059 0.052 0.114 0.910 -0.099 0.111
x7 3.019e-06 0.001 0.005 0.996 -0.001 0.001
==============================================================================
Omnibus: 5.212 Durbin-Watson: 2.269
Prob(Omnibus): 0.074 Jarque-Bera (JB): 3.941
Skew: 0.576 Prob(JB): 0.139
Kurtosis: 4.019 Cond. No. 2.05e+12
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.05e+12. This might indicate that there are
strong multicollinearity or other numerical problems.
Degree = 8
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.999
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 2850.
Date: Sat, 25 Feb 2023 Prob (F-statistic): 3.22e-42
Time: 20:41:24 Log-Likelihood: -158.27
No. Observations: 40 AIC: 334.5
Df Residuals: 31 BIC: 349.7
Df Model: 8
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.147e+05 5.26e+04 2.179 0.037 7333.841 2.22e+05
x1 -8.288e+04 3.94e+04 -2.106 0.043 -1.63e+05 -2600.230
x2 2.568e+04 1.26e+04 2.037 0.050 -35.905 5.14e+04
x3 -4480.0453 2260.467 -1.982 0.056 -9090.297 130.207
x4 482.9732 248.272 1.945 0.061 -23.382 989.328
x5 -32.9187 17.117 -1.923 0.064 -67.829 1.992
x6 1.3835 0.724 1.911 0.065 -0.093 2.860
x7 -0.0328 0.017 -1.906 0.066 -0.068 0.002
x8 0.0003 0.000 1.907 0.066 -2.32e-05 0.001
==============================================================================
Omnibus: 2.751 Durbin-Watson: 2.267
Prob(Omnibus): 0.253 Jarque-Bera (JB): 1.627
Skew: 0.381 Prob(JB): 0.443
Kurtosis: 3.629 Cond. No. 1.30e+14
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.3e+14. This might indicate that there are
strong multicollinearity or other numerical problems.
Degree = 9
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.999
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 2803.
Date: Sat, 25 Feb 2023 Prob (F-statistic): 4.16e-42
Time: 20:41:24 Log-Likelihood: -158.60
No. Observations: 40 AIC: 335.2
Df Residuals: 31 BIC: 350.4
Df Model: 8
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 8042.1814 3287.647 2.446 0.020 1336.980 1.47e+04
x1 6740.7749 2720.947 2.477 0.019 1191.368 1.23e+04
x2 -7118.8655 2994.572 -2.377 0.024 -1.32e+04 -1011.395
x3 2382.6959 1041.786 2.287 0.029 257.959 4507.433
x4 -422.3859 191.633 -2.204 0.035 -813.224 -31.548
x5 45.2197 21.158 2.137 0.041 2.067 88.372
x6 -3.0316 1.453 -2.087 0.045 -5.995 -0.068
x7 0.1248 0.061 2.049 0.049 0.001 0.249
x8 -0.0029 0.001 -2.023 0.052 -0.006 2.38e-05
x9 2.893e-05 1.44e-05 2.005 0.054 -4.99e-07 5.84e-05
==============================================================================
Omnibus: 2.655 Durbin-Watson: 2.363
Prob(Omnibus): 0.265 Jarque-Bera (JB): 1.547
Skew: 0.345 Prob(JB): 0.461
Kurtosis: 3.672 Cond. No. 9.02e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.02e+15. This might indicate that there are
strong multicollinearity or other numerical problems.
Degree = 10
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.999
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 2634.
Date: Sat, 25 Feb 2023 Prob (F-statistic): 1.09e-41
Time: 20:41:24 Log-Likelihood: -159.84
No. Observations: 40 AIC: 337.7
Df Residuals: 31 BIC: 352.9
Df Model: 8
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 381.0937 165.108 2.308 0.028 44.354 717.833
x1 737.6856 319.425 2.309 0.028 86.213 1389.158
x2 623.0014 269.229 2.314 0.027 73.906 1172.097
x3 -692.4313 301.175 -2.299 0.028 -1306.682 -78.181
x4 233.4806 104.022 2.245 0.032 21.326 445.635
x5 -41.0174 18.875 -2.173 0.038 -79.513 -2.522
x6 4.3110 2.050 2.103 0.044 0.130 8.492
x7 -0.2823 0.138 -2.041 0.050 -0.564 -0.000
x8 0.0113 0.006 1.989 0.056 -0.000 0.023
x9 -0.0003 0.000 -1.946 0.061 -0.001 1.22e-05
x10 2.491e-06 1.3e-06 1.913 0.065 -1.65e-07 5.15e-06
==============================================================================
Omnibus: 2.860 Durbin-Watson: 2.440
Prob(Omnibus): 0.239 Jarque-Bera (JB): 1.718
Skew: 0.384 Prob(JB): 0.424
Kurtosis: 3.664 Cond. No. 6.11e+17
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.11e+17. This might indicate that there are
strong multicollinearity or other numerical problems.
##Data with timestamp as a 6 degree polynomial feature
#In the analysis above, it was determined that a 6 degree polynomial feature was the best fit for the x_time_stamp feature
data_timestamp_6_degree_train = X_train.copy()
data_timestamp_6_degree_train['x_time_stamp2'] = data_timestamp_6_degree_train['x_time_stamp']**2
data_timestamp_6_degree_train['x_time_stamp3'] = data_timestamp_6_degree_train['x_time_stamp']**3
data_timestamp_6_degree_train['x_time_stamp4'] = data_timestamp_6_degree_train['x_time_stamp']**4
data_timestamp_6_degree_train['x_time_stamp5'] = data_timestamp_6_degree_train['x_time_stamp']**5
data_timestamp_6_degree_train['x_time_stamp6'] = data_timestamp_6_degree_train['x_time_stamp']**6
#data_timestamp_6_degree_train = data_timestamp_6_degree_train.drop(columns = ['dc_power'])
data_timestamp_6_degree_test = X_test.copy()
data_timestamp_6_degree_test['x_time_stamp2'] = data_timestamp_6_degree_test['x_time_stamp']**2
data_timestamp_6_degree_test['x_time_stamp3'] = data_timestamp_6_degree_test['x_time_stamp']**3
data_timestamp_6_degree_test['x_time_stamp4'] = data_timestamp_6_degree_test['x_time_stamp']**4
data_timestamp_6_degree_test['x_time_stamp5'] = data_timestamp_6_degree_test['x_time_stamp']**5
data_timestamp_6_degree_test['x_time_stamp6'] = data_timestamp_6_degree_test['x_time_stamp']**6
#X_train_timestamp_6_degree,X_test_timestamp_6_degree,y_train_timestamp_6_degree,y_test_timestamp_6_degree = train_test_split(X_timestamp_6_degree,y_timestamp_6_degree,test_size = 0.2, random_state = 42)
#X_train_timestamp_6_degree = sm.add_constant(X_train_timestamp_6_degree)
#X_test_timestamp_6_degree = sm.add_constant(X_test_timestamp_6_degree)
##Select best features from the data without outliers and the timestamp as a 6 degree polynomial feature
best_features_poly, ols_model_poly = forward_selection(data_timestamp_6_degree_train,y_train)
#Save the model
joblib.dump(ols_model_poly, os.path.join(model_path, 'ols_model_poly.pkl'))
##Print the information about the best features
print("Feature, p-value, adjusted R^2:\n")
for key, value in best_features_poly.items():
print(key, value)
Feature, p-value, adjusted R^2: irradiation (0.0, 0.9746659792698716) source_key_Quc1TzYxW2pYoWX (0.0, 0.9757064068251208) month (3.6498935499972564e-186, 0.9761912815681737) ambient_temperature (6.062068956151132e-293, 0.9769381545533012) x_time_stamp6 (8.16598404125645e-120, 0.9772334823798897) x_time_stamp (0.0, 0.9789443424002355) source_key_bvBOhCH3iADSZry (2.547232410595928e-144, 0.9792697455764533) module_temperature (4.3618475736363384e-127, 0.9795515376640248) source_key_Et9kgGMDl729KT4 (1.8576223704666486e-122, 0.979819269743865) source_key_1BY6WEcLGh8j5v7 (1.0955736038224695e-105, 0.9800468679158433) x_time_stamp2 (4.4978303191331947e-66, 0.9801861395165434) source_key_rrq4fwE8jgrTyWY (5.319204921794335e-52, 0.9802941321907919) source_key_LYwnQax7tkwH5Cb (1.509539148276051e-29, 0.9803534363823236) source_key_oZ35aAeoifZaQzV (4.1557149952167194e-24, 0.9804009599953867) source_key_4UPUqMRk7TRMgml (8.174038076903659e-26, 0.9804519989579374) source_key_adLQvlD726eNBSB (1.218213929138763e-21, 0.9804940558240063) source_key_Mx2yZCDsyf6DPfv (6.776246854931744e-21, 0.980534447461779) source_key_Qf4GUc1pJu5T6c6 (6.884292708815206e-17, 0.9805663123555816) source_key_ih0vzX44oOqAx2f (5.31621236258072e-16, 0.9805962611002599) day (2.672359324203855e-11, 0.9806163340978019) source_key_q49J1IKaHRwDQnt (1.3014923182458206e-10, 0.9806349566504476) source_key_PeE6FRyGXUgsRhN (1.4983995752918998e-10, 0.9806534347050585) source_key_1IF53ai7Xc0U56Y (5.175179985765482e-09, 0.9806687108653785) source_key_IQ2d7wF4YD8zU1Q (1.0184552784229467e-08, 0.980683368754029) source_key_VHMLBKoKgIrUVDU (1.89157980321678e-08, 0.9806974621137218) source_key_McdE0feGgRqW7Ca (1.9080177898621578e-08, 0.9807115377969775) source_key_WcxssY2VbP4hApt (6.557651526375273e-08, 0.9807245026170099) source_key_oZZkBaNadn6DNKz (3.449956085073357e-08, 0.9807380311093687) source_key_vOuJvMaM2sgwLmb (4.3358559052937316e-08, 0.9807513469133787) source_key_3PZuoBAID5Wc2HD (4.2882899429946346e-08, 0.980764663641404) source_key_V94E5Ben1TlhnDV (1.4595461398316023e-05, 0.9807728241810133) source_key_zBIq5rxdHJRwDNY (0.00016686598622001766, 0.9807788642377404) source_key_ZoEaEvLYb1n2sOq (9.291016480704167e-05, 0.9807854085282299) source_key_NgDl19wMapZy17u (0.00360966113281367, 0.9807888323630309) source_key_wCURE6d3bPkepu2 (0.006442311856343132, 0.980791775371832) x_time_stamp4 (0.0075886637883365304, 0.980794583227) source_key_iCRJl6heRkivqQ3 (0.01240801141691949, 0.9807969892292313) source_key_xMbIugepa2P7lBB (0.009790226198015917, 0.9807995878594427) source_key_mqwcsP2rE7J0TFp (0.005760723287928042, 0.980802621761431) source_key_ZnxXDlPa8U1GXgE (0.005156218609413345, 0.9808057469137622) source_key_81aHJ1q11NBPMrL (0.03227523591588995, 0.9808073880606634) source_key_uHbuxQJl8lW7ozc (0.024284873504654768, 0.9808092536292817) source_key_zVJPv84UY57bAof (0.01778835251491226, 0.9808113675271115)
#Print the model summary
ols_model_poly = joblib.load(os.path.join(model_path, 'ols_model_poly.pkl'))
print(ols_model_poly.summary())
OLS Regression Results
==============================================================================
Dep. Variable: dc_power R-squared: 0.981
Model: OLS Adj. R-squared: 0.981
Method: Least Squares F-statistic: 4.987e+04
Date: Mon, 27 Feb 2023 Prob (F-statistic): 0.00
Time: 11:11:08 Log-Likelihood: -2.2555e+05
No. Observations: 41956 AIC: 4.512e+05
Df Residuals: 41912 BIC: 4.516e+05
Df Model: 43
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -412.4595 25.114 -16.423 0.000 -461.684 -363.235
irradiation 1277.5793 3.150 405.527 0.000 1271.404 1283.754
source_key_Quc1TzYxW2pYoWX -83.2154 1.794 -46.379 0.000 -86.732 -79.699
month 2.3740 1.148 2.067 0.039 0.123 4.625
ambient_temperature 4.2066 0.186 22.592 0.000 3.842 4.572
x_time_stamp6 -7.6e-06 1.6e-06 -4.761 0.000 -1.07e-05 -4.47e-06
x_time_stamp 65.9845 6.992 9.437 0.000 52.280 79.689
source_key_bvBOhCH3iADSZry -43.1227 1.902 -22.667 0.000 -46.852 -39.394
module_temperature -2.5276 0.109 -23.256 0.000 -2.741 -2.315
source_key_Et9kgGMDl729KT4 -35.8413 1.802 -19.893 0.000 -39.373 -32.310
source_key_1BY6WEcLGh8j5v7 -35.7267 1.898 -18.827 0.000 -39.446 -32.007
x_time_stamp2 -3.1129 0.573 -5.433 0.000 -4.236 -1.990
source_key_rrq4fwE8jgrTyWY -18.0256 1.737 -10.375 0.000 -21.431 -14.620
source_key_LYwnQax7tkwH5Cb -11.2032 1.768 -6.337 0.000 -14.668 -7.738
source_key_oZ35aAeoifZaQzV 23.7909 1.669 14.259 0.000 20.521 27.061
source_key_4UPUqMRk7TRMgml 23.6410 1.673 14.132 0.000 20.362 26.920
source_key_adLQvlD726eNBSB 21.9324 1.883 11.650 0.000 18.243 25.622
source_key_Mx2yZCDsyf6DPfv 20.7486 1.661 12.488 0.000 17.492 24.005
source_key_Qf4GUc1pJu5T6c6 18.4200 1.662 11.084 0.000 15.163 21.677
source_key_ih0vzX44oOqAx2f -10.6418 1.893 -5.623 0.000 -14.351 -6.932
day 0.3752 0.052 7.247 0.000 0.274 0.477
source_key_q49J1IKaHRwDQnt -4.8170 1.715 -2.809 0.005 -8.178 -1.456
source_key_PeE6FRyGXUgsRhN -3.9209 1.683 -2.330 0.020 -7.219 -0.623
source_key_1IF53ai7Xc0U56Y 15.1076 1.870 8.077 0.000 11.442 18.774
source_key_IQ2d7wF4YD8zU1Q 16.0741 1.852 8.677 0.000 12.443 19.705
source_key_VHMLBKoKgIrUVDU 14.1020 1.878 7.510 0.000 10.422 17.782
source_key_McdE0feGgRqW7Ca 13.6041 1.869 7.279 0.000 9.941 17.268
source_key_WcxssY2VbP4hApt 13.8989 1.747 7.958 0.000 10.475 17.322
source_key_oZZkBaNadn6DNKz 12.9318 1.666 7.763 0.000 9.667 16.197
source_key_vOuJvMaM2sgwLmb 12.3649 1.689 7.319 0.000 9.054 15.676
source_key_3PZuoBAID5Wc2HD 12.1898 1.886 6.464 0.000 8.493 15.886
source_key_V94E5Ben1TlhnDV 9.4891 1.665 5.698 0.000 6.225 12.753
source_key_zBIq5rxdHJRwDNY -4.8033 1.889 -2.543 0.011 -8.506 -1.101
source_key_ZoEaEvLYb1n2sOq -4.6365 1.875 -2.473 0.013 -8.311 -0.962
source_key_NgDl19wMapZy17u 7.9756 1.875 4.254 0.000 4.300 11.651
source_key_wCURE6d3bPkepu2 7.0741 1.854 3.816 0.000 3.441 10.707
x_time_stamp4 0.0036 0.001 2.736 0.006 0.001 0.006
source_key_iCRJl6heRkivqQ3 6.4968 1.896 3.426 0.001 2.780 10.213
source_key_xMbIugepa2P7lBB 6.5326 1.874 3.485 0.000 2.859 10.206
source_key_mqwcsP2rE7J0TFp 6.4065 1.883 3.402 0.001 2.715 10.098
source_key_ZnxXDlPa8U1GXgE 6.2115 1.880 3.304 0.001 2.527 9.896
source_key_81aHJ1q11NBPMrL 4.2934 1.757 2.444 0.015 0.850 7.736
source_key_uHbuxQJl8lW7ozc 4.5835 1.862 2.461 0.014 0.934 8.233
source_key_zVJPv84UY57bAof 4.4904 1.895 2.370 0.018 0.777 8.204
==============================================================================
Omnibus: 7941.838 Durbin-Watson: 2.003
Prob(Omnibus): 0.000 Jarque-Bera (JB): 77028.404
Skew: -0.632 Prob(JB): 0.00
Kurtosis: 9.517 Cond. No. 1.37e+09
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.37e+09. This might indicate that there are
strong multicollinearity or other numerical problems.
##Plot the test vs predicted values for the OLS model with outliers
features_list = list(ols_model_poly.params.index)
y_pred_poly = ols_model_poly.predict(data_timestamp_6_degree_test[features_list])
#Determine the ideal line for the plot
x = np.linspace(0, max(y_pred_poly), 1000)
y = x
print(f'Rsquared Score: {r2_score(y_test, y_pred_poly)}')
plt.scatter(x = y_pred_poly, y = y_test, alpha = 0.1, label = 'Test Values')
plt.plot(x, y, color = 'red', label = 'Predicted = Test')
plt.title('Test vs Predicted Values with Polynomial', fontsize = 20)
plt.text(0,0.8*max(y_pred_poly), f'Rsquared Score: {round(r2_score(y_test, y_pred_poly),4)}', fontsize = 10)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.legend()
plt.savefig('images/ols_test_with_feature_engineering_vs_predicted.png')
plt.show()
Rsquared Score: 0.979742205004171
As we can see in the above predictions and test scores, the polynomial regression model performed better than the linear regression model. This is expected since the relationship between the x_time_stamp and the dc_power is not linear. The linear regression model without feature engineering had an r-squared of 0.978 and the polynomial regression model had an r-squared of 0.981. This shows that the feature engineering completed on the time_stamp has a statistically significant impact on the model's performance.
However, the most significant improvement for these models was removing the outliers in the data that was caused by outside influences of grid production. This is shown in the r-squared values for the linear regression models. The linear regression model without the outliers had an r-squared of 0.978 and the linear regression model with the outliers had an r-squared of 0.767. This shows that the outliers were having a significant impact on the model's performance.
Since we learned from the linear regression model that the polynomial features for the timestamp improved the model, I will start out with that dataset for the decision tree models.
In this section, I will use gridsearchCV to do some hyperparameter tuning on the random forest. Then I will chose the best parameters and run the model on the test data and plot the performance of the model. Running cross fold validation on decision trees is beneficial because these models have a tendency to over-fit the data. This is because the decision trees are greedy and will try to fit the data as best as possible. By running cross fold validation, we can ensure that the model is generalizing the data and not over-fitting the data.
##Get data for the tree model
X_tree_train = data_timestamp_6_degree_train #Rename the data to be used for the tree model for clarity
X_tree_test = data_timestamp_6_degree_test #Rename the data to be used for the tree model for clarity
##Train a Random Forest Model
rfr = RandomForestRegressor(random_state = 42, n_estimators = 5, max_depth = 10, min_samples_split = 2, min_samples_leaf = 1, max_features = None, bootstrap = True)
model = rfr.fit(X_tree_train, y_train)
print(f'Model train score {model.score(X_tree_train, y_train)}')
print(f'Model test score {model.score(X_tree_test, y_test)}')
Model train score 0.9873066530206269 Model test score 0.9845227915368935
##Hyper parameter tuning for the Random Forest Model
#parameter_grid = {'n_estimators':[10, 15],'max_depth':[None, 5, 10],'min_samples_split':[5,10,20],'min_samples_leaf':[1,5,10],'max_features':[None,'sqrt','log2']}
#Best parameters: {'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 20, 'n_estimators': 15}
#Best cross-validation score: 0.9876522410807599
parameter_grid = {'n_estimators':[150,300,500],'max_depth':[None],'min_samples_split':[20],'min_samples_leaf':[1,2],'max_features':['sqrt']}
#Best parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 20, 'n_estimators': 500}
#Best cross-validation score: 0.9868443354599717
rfr = RandomForestRegressor(random_state = 42,bootstrap=True)
grid_search = GridSearchCV(rfr, parameter_grid, cv = 5, n_jobs = -1, verbose = True)
grid_search.fit(X_tree_train, y_train)
model_rf = grid_search.best_estimator_
joblib.dump(model_rf, os.path.join(model_path, 'random_forest_model.pkl'))
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best cross-validation score: {(grid_search.best_score_)}')
print(f'Model train score {model_rf.score(X_tree_train, y_train)}')
print(f'Model test score {model_rf.score(X_tree_test, y_test)}')
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 20, 'n_estimators': 500}
Best cross-validation score: 0.9868443354599717
Model train score 0.9910182888547747
Model test score 0.9868820302717891
As we can see above, a set of parameters that performed well was the following:
These parameters provided a test score of 0.9869 and a train score of 0.9910. This shows that the model is generalizing the data and not over-fitting. It also performs better than the polynomial regression model that was tested in the previous section.
#Plot the actual vs predicted values for the Random Forest Model
model_rf = joblib.load(os.path.join(model_path, 'random_forest_model.pkl'))
yhat_rf = model_rf.predict(X_tree_test)
print(f'Best Parameters: {grid_search.best_params_}')
print(f'R^2: {r2_score(y_test, yhat_rf)}')
#Determine the ideal line for the plot
x = np.linspace(0, max(yhat_rf), 1000)
y = x
print(f'Rsquared Score: {r2_score(y_test, yhat_rf)}')
plt.scatter(x = yhat_rf, y = y_test, alpha = 0.1, label = 'Test Values')
plt.plot(x, y, color = 'red', label = 'Predicted = Test')
plt.title('Test vs Predicted Values for Random Forest', fontsize = 20)
plt.text(0,0.8*max(yhat_rf), f'Rsquared Score: {round(r2_score(y_test, yhat_rf),4)}', fontsize = 10)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.legend()
plt.savefig('images/random_forest_test_vs_predicted.png')
plt.show()
Best Parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 20, 'n_estimators': 500}
R^2: 0.9868820302717891
Rsquared Score: 0.9868820302717891
For adaBoost regression, I started off with hyperparameter tuning to determine the best parameters for the model. I used the below code to determine the best parameters for the model. I then used the best parameters to run the model and print the summary. Note: Due to computing issues, I have the grid search broken up significantly to narrow in on the best parameters. The history and results are commented in the code.
#Use grid search cv to find the best parameters for an adaBoost regressor
adrf = AdaBoostRegressor(random_state = 42)
#parameter_grid = {'n_estimators':[10, 15],'learning_rate':[0.3, 0.5, 0.7],'loss':['linear', 'square', 'exponential']}
#Best parameters: {'learning_rate': 0.5, 'loss': 'exponential', 'n_estimators': 15}
#Best cross-validation score: 0.9734290715023818
#parameter_grid = {'n_estimators':[5,10,15],'learning_rate':[0.5],'loss':['exponential']}
#Best parameters: {'learning_rate': 0.5, 'loss': 'exponential', 'n_estimators': 15}
#Best cross-validation score: 0.9734290715023818
#parameter_grid = {'n_estimators':[5,10,15],'learning_rate':[0.4,0.5,0.6],'loss':['linear', 'square','exponential']}
#Best parameters: {'learning_rate': 0.6, 'loss': 'exponential', 'n_estimators': 15}
#Best cross-validation score: 0.9735801480586428
#parameter_grid = {'n_estimators':[10,15,20],'learning_rate':[0.6,0.7],'loss':['exponential']}
#Best parameters: {'learning_rate': 0.6, 'loss': 'exponential', 'n_estimators': 15}
#Best cross-validation score: 0.9735801480586426
parameter_grid = {'n_estimators':[15,100],'learning_rate':[0.6],'loss':['exponential']}
grid_search = GridSearchCV(adrf, parameter_grid, cv = 5, n_jobs = -1, verbose = True)
grid_search.fit(X_tree_train, y_train)
model_adrf = grid_search.best_estimator_
joblib.dump(model_adrf, os.path.join(model_path, 'adaboost_model.pkl'))
Fitting 5 folds for each of 2 candidates, totalling 10 fits
['models\\adaboost_model.pkl']
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best cross-validation score: {(grid_search.best_score_)}')
print(f'Model train score {model_adrf.score(X_tree_train, y_train)}')
print(f'Model test score {model_adrf.score(X_tree_test, y_test)}')
Best parameters: {'learning_rate': 0.6, 'loss': 'exponential', 'n_estimators': 15}
Best cross-validation score: 0.9735801480586426
Model train score 0.9739220876966134
Model test score 0.9733982845240751
As we can see above, a set of parameters that performed well was the following:
These parameters provided a test score of 0.9734 and a train score of 0.9739. This shows that the model is generalizing the data and not over-fitting.
As we can see in the plot, there are a lot of predictions that are the same value. This happens because the number of stumps is a small enough number that it isn't able to predict to a fine enough granularity. This is why it is performing worse than the random forest model, which is being voted on by a large number of decision trees. When I was doing the grid search CV, I provided the opportunity for the model to use more stumps(up to 100); however, this was causing the model to over-fit the data and perform worse on the validation set. Thus, the best tested parameters were the ones that were used in the final model, despite the appearance that it isn't fitting well enough.
#Plot the actual vs predicted values for the adaBoost Model
model_adrf = joblib.load(os.path.join(model_path, 'adaboost_model.pkl'))
yhat_adrf = model_adrf.predict(X_tree_test)
print(f'Best Parameters: {grid_search.best_params_}')
print(f'R^2: {r2_score(y_test, yhat_adrf)}')
#Determine the ideal line for the plot
x = np.linspace(0, max(yhat_adrf), 1000)
y = x
print(f'Rsquared Score: {r2_score(y_test, yhat_adrf)}')
plt.scatter(x = yhat_adrf, y = y_test, alpha = 0.1, label = 'Test Values')
plt.plot(x, y, color = 'red', label = 'Predicted = Test')
plt.title('Test vs Predicted Values for adaBoost', fontsize = 20)
plt.text(0,0.8*max(yhat_adrf), f'Rsquared Score: {round(r2_score(y_test, yhat_adrf),4)}', fontsize = 10)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.legend()
plt.savefig('images/adaBoost_test_vs_predicted.png')
plt.show()
Best Parameters: {'learning_rate': 0.6, 'loss': 'exponential', 'n_estimators': 15}
R^2: 0.9733982845240751
Rsquared Score: 0.9733982845240751
For the gradient boosting regression, I started off with hyperparameter tuning to determine the best parameters for the model. I used the below code to determine the best parameters for the model. I then used the best parameters to run the model and print the summary. Note: Due to computing issues, I have the grid search broken up significantly to narrow in on the best parameters. The history and results are commented in the code.
## Use grid search cv to find the best parameters for a gradient boosting regressor
xgb = XGBRegressor(random_state = 42)
'''
parameter_grid = {'n_estimators':[10, 15],
'learning_rate':[0.3, 0.5, 0.7],
'max_depth':[None,25],
'subsample':[0.5,0.7,1.0],
'reg_alpha':[0.5,1.0,1.5],
'reg_lambda':[0.5,1.0,1.5],
'gamma':[0.5,1.0,1.5],
'min_child_weight':[0.5,1.0,1.5]}
#Best parameters: {'gamma': 1.5, 'learning_rate': 0.3, 'max_depth': 25, 'min_child_weight': 1.5, 'n_estimators': 15, 'reg_alpha': 1.5, 'reg_lambda': 1.5, 'subsample': 1.0}
#Best cross-validation score: 0.9878195763482076
parameter_grid = {'n_estimators':[15,20],
'learning_rate':[0.3],
'max_depth':[None,25,15],
'subsample':[1.0],
'reg_alpha':[1.5],
'reg_lambda':[1.5],
'gamma':[1.5],
'min_child_weight':[0.5,1.5]}
#Best parameters: {'gamma': 1.5, 'learning_rate': 0.3, 'max_depth': 25, 'min_child_weight': 1.5, 'n_estimators': 15, 'reg_alpha': 1.5, 'reg_lambda': 1.5, 'subsample': 1.0}
#Best cross-validation score: 0.9878195763482076
parameter_grid = {'n_estimators':[10,15],
'learning_rate':[0.2,0.3],
'max_depth':[None,25,35],
'subsample':[0.9,1.0],
'reg_alpha':[1.5],
'reg_lambda':[1.5],
'gamma':[1.5],
'min_child_weight':[0.5,1.5]}
Best parameters: {'gamma': 1.5, 'learning_rate': 0.3, 'max_depth': 25, 'min_child_weight': 1.5, 'n_estimators': 15, 'reg_alpha': 1.5, 'reg_lambda': 1.5, 'subsample': 1.0}
Best cross-validation score: 0.9878195763482076
'''
parameter_grid = {'n_estimators':[15],
'learning_rate':[0.3],
'max_depth':[25],
'subsample':[1.0],
'reg_alpha':[1.25, 1.5, 1.75],
'reg_lambda':[1.25, 1.5, 1.75],
'gamma':[1.25, 1.5, 1.75],
'min_child_weight':[1.5]}
#Best parameters: {'gamma': 1.25, 'learning_rate': 0.3, 'max_depth': 25, 'min_child_weight': 1.5, 'n_estimators': 15, 'reg_alpha': 1.5, 'reg_lambda': 1.5, 'subsample': 1.0}
#Best cross-validation score: 0.9878324286112157
grid_search = GridSearchCV(xgb, parameter_grid, cv = 5, n_jobs = -1, verbose = True)
grid_search.fit(X_tree_train, y_train)
model_xgb = grid_search.best_estimator_
joblib.dump(model_xgb, os.path.join(model_path, 'xgboost_model.pkl'))
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best cross-validation score: {(grid_search.best_score_)}')
print(f'Model train score {model_xgb.score(X_tree_train, y_train)}')
print(f'Model test score {model_xgb.score(X_tree_test, y_test)}')
Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best parameters: {'gamma': 1.25, 'learning_rate': 0.3, 'max_depth': 25, 'min_child_weight': 1.5, 'n_estimators': 15, 'reg_alpha': 1.75, 'reg_lambda': 1.5, 'subsample': 1.0}
Best cross-validation score: 0.9877023981485994
Model train score 0.9963129067204253
Model test score 0.986910961768458
As we can see above, a set of parameters that performed well was the following:
These parameters provided a test score of 0.9869 and a train score of 0.0.9963. This shows that the model is generalizing the data and not over-fitting.
#Plot the actual vs predicted values for the Gradient Boost Model
model_xgb = joblib.load(os.path.join(model_path, 'xgboost_model.pkl'))
yhat_xgb = model_xgb.predict(X_tree_test)
print(f'Best Parameters: {grid_search.best_params_}')
print(f'R^2: {r2_score(y_test, yhat_xgb)}')
#Determine the ideal line for the plot
x = np.linspace(0, max(yhat_xgb), 1000)
y = x
print(f'Rsquared Score: {round(r2_score(y_test, yhat_xgb),4)}')
plt.scatter(x = yhat_xgb, y = y_test, alpha = 0.1, label = 'Test Values')
plt.plot(x, y, color = 'red', label = 'Predicted = Test')
plt.title('Test vs Predicted Values for Gradient Boost', fontsize = 20)
plt.text(0,0.8*max(yhat_xgb), f'Rsquared Score: {round(r2_score(y_test, yhat_xgb),4)}', fontsize = 10)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.legend()
plt.savefig('images/gradient_boost_test_vs_predicted.png')
plt.show()
Best Parameters: {'gamma': 1.25, 'learning_rate': 0.3, 'max_depth': 25, 'min_child_weight': 1.5, 'n_estimators': 15, 'reg_alpha': 1.75, 'reg_lambda': 1.5, 'subsample': 1.0}
R^2: 0.986910961768458
Rsquared Score: 0.9869
As we can see in the below chart, the decision tree models performed the best compared to the linear/polynomial regression. In order to achieve this performance I did hyperparameter tuning to ensure that I didn't overfit the training data in addition to feature engineering of the timestamp data and removing the outliers.
Removing the outliers had the single greatest impact on all of the models. It is critical the proper amount of time is spent to understand the data and ensure that the data is clean. This will ensure that the model isn't attempting to learn on data that isn't representative of the goal of the model.
At the end of each model's section, I plotted the graph that shows the actual data vs the predicted data along with the perfect line. This is a great way to visualize the performance of the model and to understand if the residuals are randomly distributed around the perfect line. If the residuals are not randomly distributed around the perfect line, then the model is not performing well and the data needs to be further understood.
results_dict = {'Linear Regression without Outliers': r2_score(y_test, y_pred_ols),
'Linear Regression with Outliers': r2_score(y_test_outliers, y_pred_outliers),
'Polynomial Regression':r2_score(y_test, y_pred_poly),
'Random Forest': r2_score(y_test, yhat_rf),
'adaBoost': r2_score(y_test, yhat_adrf),
'Gradient Boost': r2_score(y_test, yhat_xgb)}
#Create a bar chart
plt.figure(figsize = (10, 8))
plt.bar(results_dict.keys(), results_dict.values(), color = 'blue')
plt.title('R^2 Values for Each Model', fontsize = 20)
#Add the R^2 values to the chart
for i, v in enumerate(results_dict.values()):
plt.text(i - 0.25, v + 0.001, str(round(v, 4)))
plt.ylim(0.7, 1)
plt.xticks(rotation = 90)
plt.xlabel('Model')
plt.ylabel('R^2 Value')
plt.show()
Since Random Forest and Gradient Boost both of the same score and ability to generalize on the test data, I will use the time to predict as the tie breaking score for my recommendation for a production model. The time to predict is significantly different between the two models; thus, I would recommend the gradient boost model be used for production environments.
%%timeit
#Time the gradient boost's prediction
yhat_xgb = model_xgb.predict(X_tree_test)
43.2 ms ± 5.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
#time the random forest's prediction
yhat_rf = model_rf.predict(X_tree_test)
1.59 s ± 65.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Summary of the test vs predictions of each model
import matplotlib.image as mpimg
fig, axes = plt.subplots(2,3, figsize = (20, 10))
fig.suptitle('Test vs Predicted Values for Each Model', fontsize = 20)
axes[0,0].imshow(mpimg.imread('images/ols_test_vs_predicted.png'))
axes[0,1].imshow(mpimg.imread('images/ols_test_with_outliers_vs_predicted.png'))
axes[0,2].imshow(mpimg.imread('images/ols_test_with_feature_engineering_vs_predicted.png'))
axes[1,0].imshow(mpimg.imread('images/random_forest_test_vs_predicted.png'))
axes[1,1].imshow(mpimg.imread('images/adaboost_test_vs_predicted.png'))
axes[1,2].imshow(mpimg.imread('images/gradient_boost_test_vs_predicted.png'))
plt.show()
In the above plot, "Test vs Predicted Values for Each Model", we can see that most of the models performed similar to each other, with the exception of the model that included outliers and the adaBoost model.
The model that included outliers had a significant amount of residuals that were not randomly distributed around the perfect line. This plot reinforces the information provided in the bar chart above that shows the r-squared scores for each model.
A notable observation with the adaBoost model is that the model is not able to predict to a fine enough granularity to predict the actual values. Although this isn't reducing the r-squared score as significantly as the outerliers impacted the score, it is still reducing the model's performance.
I would choose to use the gradient boost model over the other models because it does a good job of generlizing to the data, and when I timed the model's predict function, it was the fastest at 43ms and performed just as well as the random forest which had a prediction time of 1590 ms for the same dataset.
The primary challenge and takeaway that I had from this project was the importance of understanding the data and ensuring that it was clean and representative of the goals. In this case, the goal was to predict power generation using weather data, and an auxiliary goal was to understand hardware deterioration. During the course of this project, I went through several attempts to remove the outliers from the data before I settled on the methodology that I presented in this report.
Another challenge that I had throughout this project was keeping track of the features to ensure they were consistent through the training and testing data. This was primarily an issue using PolynomialFeatures because the output of that transform didn't include the original feature names. I was able to overcome this issue by transforming it back to a dataframe and reassigning the column names to the transformed feature names.
I think that the models that I created, primarily the decision tree models work well for their intended purpose. For future work, I would like to get more data since this dataset only included about a month worth of data in the middle of May-June. It would be interesting to see seasonality being applied to this data to account for the variation in the irradiation values. Overall, I think that the models that I created are a good starting point for predicting power generation and understanding hardware deterioration.